Rchemist - stringr을 이용한 문자 추출하기

아래와 같은 문자열 데이터가 있다고 가정해봅시다.

 [1] "                         location         age year  sales"
 [2] "1          South-East Asia Region   <20 years 2002 0.01"  
 [3] "2      Western Sub-Saharan Africa 20-24 years 2010 0.04"  
 [4] "3      Commonwealth Middle Income 40-44 years 2003 0.18"  
 [5] "4                  Eastern Europe 45-49 years 2008 0.37"  
 [6] "5  World Bank Lower Middle Income   <20 years 2005 0.01"  
 [7] "6                         Oceania 45-49 years 2006 0.26"  
 [8] "7      Commonwealth Middle Income 55-59 years 2004 0.30"  
 [9] "8      Western Sub-Saharan Africa 30-34 years 1997 0.04"  
[10] "9        High-income Asia Pacific 65-74 years 2000 0.24"

비록 문자 형태의 벡터이지만, 열 단위로 묶인 데이터프레임의 형태를 띄고 있습니다.

저 문자열 속에서 location, age, year, sales 를 나눠 데이터프레임으로 만들어주려면 (~~일일이 복붙하면서 데이터프레임으로 선언하는 노가다를 해야 합니다)~~ 정규표현식(Regular expression)을 사용해야 합니다.

정규표현식 규칙은 크게 다음과 같습니다.

숫자: \\d 또는 [:digit:]
문자: [:alpha:]
숫자 또는 문자: [:alnum:]
0개 이상: *
한 개 이상: +
시작하는 단어: ^
끝나는 단어: $
구두점 등의 특수문자: [:punct:] 또는 [:symbol:] 또는 \\특수기호

R에서 특수문자 정규표현식을 찾을 때, 사용될 특수문자에 따라 [:punct:] 또는 [:symbol:]을 사용해야 합니다.
```
library(stringi)
ascii <- stri_enc_fromutf32(1:127)
message("Punct: ", stri_extract_all_regex(ascii, "[[:punct:]]")[[1]])
message("Symbol: ", stri_extract_all_regex(ascii, "[[:symbol:]]")[[1]])
```

여러 가지 규칙을 한 번에 찾을 때, 규칙이 긴 것부터 찾습니다. 다시 말해 찾고자 하는 단어가 긴 단어의 규칙을 먼저 입력해야 합니다.

여러 개의 패턴을 적용할 때 | 을 단위로 정규표현식을 찾아줄 수가 있습니다. paste() 또는 paste0() 함수의 collapse를 이용하여 여러 패턴을 붙여줄 수 있습니다.

자 이제 처음에 보여드렸던 문자 데이터에서 각각의 데이터를 추출하여 데이터프레임으로 만들어보겠습니다. 가장 먼저 location 입니다.

지역이 굉장히 다양하고, 또 단어마다 규칙들이 많아 찾기가 어려운 규칙입니다.

library(stringr)
f <- "[:alpha:]+ [:alpha:]+ [:alpha:]+ [:alpha:]+ [:alpha:]+"
g <- "[:alpha:]+-[:alpha:]+ [:alpha:]+ - [:alpha:]+"
d <- "[:alpha:]+-[:alpha:]+ [:alpha:]+ [:alpha:]+"
h <- "[:alpha:]+ [:alpha:]+-[:alpha:]+ [:alpha:]+"
a <-  "[:alpha:]+ [:alpha:]+ [:alpha:]+"
c <- "[:alpha:]+ [:alpha:]+-[:alpha:]+"
e <- "[:alpha:]+ [:alpha:]+"
b <-  "[:alpha:]+"
ptrn_loc <- paste0(c(f,g,d,h,a,c,e,b),collapse = "|")

loc <- str_extract(
  string = temp[2:length(temp)],
  pattern = ptrn_loc)

head(loc)

[1] "South-East Asia Region"         "Western Sub-Saharan Africa"    
[3] "Commonwealth Middle Income"     "Eastern Europe"                
[5] "World Bank Lower Middle Income" "Oceania"

다음은 age를 찾아보겠습니다. age를 살펴보면, 네 가지 정도로 분류할 수 있습니다. 50-54 형태, 80+ 형태 <20 형태, 그리고 All ages가 있죠.

aa <- "[:digit:]{2}-[:digit:]{2} years"
bb <- "\\<[:digit:]{2}+ years"
cc <- "[:digit:]{2}\\+ years"
dd <- "All ages" # all ages
ptrn_age <- paste0(c(aa,bb,cc,dd),collapse = "|")
ages <- str_extract(
  string = temp[2:length(temp)],
  pattern = ptrn_age
)

head(ages)

[1] "<20 years"   "20-24 years" "40-44 years" "45-49 years" "<20 years"  
[6] "45-49 years"

나머지 year와 sales는 앞의 두 가지보다 간단합니다.

year <- str_extract(
  string = temp[2:length(temp)],
  pattern = "[:digit:]{4}"
) |> as.numeric()

head(year)

[1] 2002 2010 2003 2008 2005 2006

sales <- str_extract(
  string = temp[2:length(temp)],
  pattern = "[:digit:]{1}\\.[:digit:]{2}"
) |> as.numeric()

head(sales)

[1] 0.01 0.04 0.18 0.37 0.01 0.26

이제 위에서 추출한 loc, ages, year, sales를 하나의 데이터프레임으로 만들어주면 끝입니다.

df <- data.frame(
  location = loc,
  age = ages,
  year = year,
  sales = sales
)

head(df)

레퍼런스

https://continuous-development.tistory.com/33
https://stackoverflow.com/questions/26348643/r-regex-with-stringi-icu-why-is-a-considered-a-non-punct-character